• Introduction
  • Data
    • Data Dictionary
    • Initial Data Analysis
    • Cleaning and Preparation
    • Feature Creation
    • Exploratory Data Analysis
      • Univariate
      • Multivariate
  • Pre-processing
    • Transform Categorical and Numerical features
  • Classification Models
    • Metrics Definitions
    • Logistic Regression
      • Cross validation and Paramneters
      • Test Set
      • Analysis
    • Support Vector Machine
      • Linear Model
      • Non Linear
      • Analysis
    • Tree Based Models
      • Data Preparation
    • Xtreme Gradient Boosting
      • Training Data
      • Test Set Metrics
      • Feature Importance
    • Gradient Boost
      • Cross Validation Scores
      • Test Set
      • Features
    • AdaBoost
      • Training
      • Test Set
      • Feature Importance
  • Clusters
    • Cluster Review
      • Overview
      • Cluster 1
      • Cluster 2
      • Cluster 3
      • Cluster 4
      • Cluster 5
      • Cluster 6
    • Cluster Analysis
  • Insights
library(rmarkdown)

library(reticulate)

library(knitr)



library(dplyr)

library(ggplot2)

library(tidyr)

library(pander)

library(kableExtra)

library(forcats)

library(skimr)

library(gt)

library(GGally)

library(gmodels)

library(plotly)

library(scales)


library(ggforce)


library(lubridate)

library(DT)

library(visdat)



library(packcircles)

library(viridis)

library(ggthemes)

library(purrr)

library(ggtext)

library(RColorBrewer)

library(ggpubr)
library(grafify)

Introduction



Previous studies have investigated student default on the Federal Direct Student Loan program, primarily focusing on undergraduate populations. However, this project aims to shift the focus to the analysis of graduate student default. This shift is motivated by the observation that the default rate among the graduate population at the college closely aligns with that of the undergraduate population.

Studies on undergraduate defaults have identified dropping out and race as key determinants. Yet, it remains uncertain whether these factors hold the same weight for graduate populations. Consequently, this analysis seeks to evaluate various classification algorithms for their accuracy in predicting students at risk of default.

Furthermore, we plan to employ classification algorithms in conjunction with the Gower clustering technique to extract insights and patterns from the subset of graduate borrowers who default. This approach aims to uncover nuanced factors contributing to default within the graduate student population.



R 4.2.1 and Python 3.9.13 are used for this analysis.



Below are References of articles on student loan default.



Who Are Student Loan Defaulters? (2017, December 14). Center for American Progress. https://www.americanprogress.org/article/student-loan-defaulters/



Author, B. (2021, July 7). Who Is More Likely to Default on Student Loans? Liberty Street Economics. https://libertystreeteconomics.newyorkfed.org/2017/11/who-is-more-likely-to-default-on-student-loans/



Data



Data Dictionary



loan_status Y=default, N=non default
gender M=Male, F=Female
citizen_status Citizen, Eligible non-citizen
marital_status
efc Expected Family Contribution-Index determining amount a student is capable of contributing to educational cost. Calculation includes income, assets, and family size.
School SOE=SChool of Education, SOP=School of Psychology, SOM=School of Management, NEIB=New England Institute of Business
degr_cde
degr_cde_school Student’s degree and school combined.
exit_reason G=Graduated, WD=Official Withdrawal, unk=unofficial withdrawal
Major
local_hrs_attempt Total credit hours attempted
local_hrs_earned Total credit hours earned
yrs_to_pay_dte Elapsed time between start date and payment start date
yrs_to_exit_dte Elapsed time between start date and exit date
undergrad_loans_cc Loans borrowed as undergraduate students at college
grad_loans_cc Loans borrowed as graduate students at college
loans_not_cc Loans borrowed at other colleges previous to start at college
NSLDS_loan_total Total loans borrowed
age



Initial Data Analysis





dflt_grad<-dflt_grad %>% 
  mutate(div_cde=recode(div_cde, 'U2' ="UG", "U3"="UG"))

We’ll remove undergradutes from the dataset.

dflt_grad<-dflt_grad %>% 
  mutate(div_cde=recode(div_cde, 'G2' ="GR", "G3"="GR", "G4"="GR","G5"="GR"))
dflt_grad<-dflt_grad %>% 
  filter(div_cde != "SG")
skim(dflt_grad)
Data summary
Name dflt_grad
Number of rows 13206
Number of columns 32
_______________________
Column type frequency:
character 20
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
loan_status 0 1.00 1 1 0 2 0
pay_start_date 0 1.00 9 9 0 6 0
entry_dte 0 1.00 2 18 0 179 0
entrance_yr 0 1.00 2 4 0 25 0
exit_dte 0 1.00 2 19 0 318 0
exit_reason 0 1.00 1 2 0 10 0
loc_cde 0 1.00 2 5 0 13 0
div_cde 0 1.00 2 2 0 7 0
school 0 1.00 2 3 0 6 0
degr_cde 0 1.00 2 5 0 20 0
cip_desc 68 0.99 2 60 0 32 0
major_1 0 1.00 2 5 0 243 0
major_minor_desc 0 1.00 2 50 0 239 0
value_description 21 1.00 2 41 0 11 0
marital_status 0 1.00 1 9 0 8 0
dob 4 1.00 2 19 0 5478 0
gender 1 1.00 1 2 0 5 0
city 1 1.00 2 18 0 1417 0
state 2 1.00 2 2 0 44 0
citizen_status 0 1.00 1 20 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
career_hrs_attempt 2 1 51.85 33.73 1 32 41.0 62.00 163 ▅▇▁▂▁
career_hrs_earned 2 1 48.17 32.76 0 30 38.0 60.00 132 ▃▇▂▁▂
local_hrs_attempt 2 1 41.54 22.97 1 28 38.0 53.00 163 ▆▇▁▁▁
local_hrs_earned 2 1 37.85 21.89 0 24 36.0 48.00 124 ▅▇▃▁▁
xfer_hrs_earned 2 1 10.32 22.50 0 0 0.0 3.00 90 ▇▁▁▁▁
total_income 66 1 45763.76 141264.28 -5844 19775 33309.0 51419.00 4520667 ▇▁▁▁▁
efc 56 1 7087.36 33753.17 0 0 2435.0 9323.00 999999 ▇▁▁▁▁
life_sub_grad 0 1 5324.29 8656.17 0 0 0.0 8500.00 57289 ▇▂▁▁▁
life_sub_ugrad 0 1 3379.83 5936.45 0 0 0.0 4879.75 23000 ▇▁▁▁▁
life_unsub_grad 0 1 17737.21 16486.77 0 0 15400.0 28615.00 91812 ▇▅▂▁▁
life_unsub_ugrad 0 1 4047.50 7515.94 0 0 0.0 6000.00 48399 ▇▁▁▁▁
nslds_loan_total 8 1 39019.26 27311.63 0 18875 33977.5 53468.75 161460 ▇▆▂▁▁



Cleaning and Preparation

Cleaning and preparation involves removing duplicate rows/columns, dropping or combining categories, renaming variables or categories, removing null entries,creating new variables, and formatting date features.



The date columns are type characters. These columns will be used to create new features and as such will need to be converted to a date format.



pay start date

#convert character column to date
dflt_grad$pay_start_date<-as.Date(as.character(dflt_grad$pay_start_date),format="%m/%d/%Y")

#convert date column to POSIXct format

dflt_grad$pay_start_date<-as.POSIXct(dflt_grad$pay_start_date,format="%Y/%m/%d")



entry data

#convert character column to date
dflt_grad$entry_dte<-as.Date(as.character(dflt_grad$entry_dte),format="%m/%d/%Y")

#convert date column to POSIXct format

dflt_grad$entry_dte<-as.POSIXct(dflt_grad$entry_dte,format="%Y/%m/%d")



Date of Dirth

#convert character column to date
dflt_grad$dob<-as.Date(as.character(dflt_grad$dob),format="%m/%d/%Y")

#convert date column to POSIXct format

dflt_grad$dob<-as.POSIXct(dflt_grad$dob,format="%Y/%m/%d")



exit date

#convert character column to date
dflt_grad$exit_dte<-as.Date(as.character(dflt_grad$exit_dte),format="%m/%d/%Y")

#convert date column to POSIXct format

dflt_grad$exit_dte<-as.POSIXct(dflt_grad$exit_dte,format="%Y/%m/%d")
dflt_grad<-readRDS("dflt_grad.rds")
skim(dflt_grad)
Data summary
Name dflt_grad
Number of rows 9889
Number of columns 32
_______________________
Column type frequency:
character 18
numeric 10
POSIXct 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
loan_status 0 1.00 1 1 0 2 0
entrance_yr 0 1.00 2 4 0 23 0
exit_reason 0 1.00 1 2 0 8 0
loc_cde 0 1.00 2 5 0 12 0
div_cde 0 1.00 2 2 0 6 0
school 0 1.00 2 3 0 5 0
degr_cde 0 1.00 2 5 0 14 0
cip_desc 64 0.99 2 60 0 26 0
major_1 0 1.00 2 5 0 230 0
major_minor_desc 0 1.00 2 50 0 227 0
value_description 11 1.00 2 41 0 11 0
total_income 32 1.00 1 7 0 4616 0
marital_status 0 1.00 1 9 0 8 0
efc 32 1.00 1 6 0 3237 0
gender 1 1.00 1 2 0 5 0
city 1 1.00 2 18 0 1300 0
state 2 1.00 2 2 0 43 0
citizen_status 0 1.00 1 20 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
career_hrs_attempt 2 1 40.74 20.70 1 32 38 49 147 ▃▇▂▁▁
career_hrs_earned 2 1 37.87 19.83 0 28 36 48 132 ▃▇▂▁▁
local_hrs_attempt 2 1 38.89 17.68 1 30 38 48 135 ▃▇▂▁▁
local_hrs_earned 2 1 36.02 16.95 0 27 36 45 120 ▃▇▂▁▁
xfer_hrs_earned 2 1 1.85 8.86 0 0 0 0 90 ▇▁▁▁▁
life_sub_grad 0 1 7042.69 9321.41 0 0 0 13925 57289 ▇▂▁▁▁
life_sub_ugrad 0 1 1428.96 4259.21 0 0 0 0 23000 ▇▁▁▁▁
life_unsub_grad 0 1 23627.13 14879.58 0 12425 21667 32500 91812 ▇▇▃▁▁
life_unsub_ugrad 0 1 1705.01 5279.06 0 0 0 0 41619 ▇▁▁▁▁
nslds_loan_total 8 1 44859.55 27759.78 0 24780 40184 60083 161460 ▇▇▃▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
pay_start_date 0 1 2010-09-30 20:00:00 2018-09-29 20:00:00 2016-09-29 20:00:00 6
entry_dte 2 1 1994-09-25 20:00:00 2023-08-31 20:00:00 2012-09-12 20:00:00 104
exit_dte 2 1 2005-02-21 19:00:00 2023-08-30 20:00:00 2015-06-06 20:00:00 225
dob 18 1 1934-03-08 19:00:00 1994-09-17 20:00:00 1978-05-12 20:00:00 4067



Based on the output from the skim() function after the data transformation we know that there are now 15 categorical features that are of type character. Categorical features must be of type factor for use in classification models, thus these features will be transformed to type factor.



dflt_grad_2<-dflt_grad %>% 
  mutate_if(is.character, factor)



dflt_vis<-vis_dat(dflt_grad_2)



Figure 1

Figure 1





Table 1 displays the data set as having factor, numeric, and POSXct (date) features.



dflt_grad_2<-dflt_grad_2 %>% 
  mutate(exit_reason=recode(exit_reason, 'D' ="WD", "T"="WD", "LV"="WD", "DE"="WD")) 



dflt_grad_2<-dflt_grad_2 %>% 
  mutate(loan_status=recode(loan_status, 'B' ="Y", "D"="N")) 
dflt_grad_2<-dflt_grad_2 %>% 
  mutate(citizen_status=recode(citizen_status, '1' ="citizen", "2"="eligible_non_citizen")) 



dflt_grad_2<-dflt_grad_2 %>% 
  mutate(degr_cde=recode(degr_cde, "POGCT" ="CT", "POBCT"="CT","PRECT" ="CT","EDD"="PHD", "EDL"="PHD", "ME"="MED","DBA" ="PHD")) 



dflt_grad_2$degr_cde<-droplevels(dflt_grad_2$degr_cde)
sum_1<-dflt_grad_2 %>% 
  select(exit_reason, loan_status, div_cde) %>% 
  summary()



Table 1: Summary Factor Variables
exit_reason loan_status div_cde
WD:1377 Y: 575 CA: 831
G :7145 N:9314 CG: 65
NA: 2 NA DB: 19
UK:1365 NA DL: 66
NA NA GR:8906
NA NA NA: 2



Feature Creation






import datetime


import dateutil


from datetime import date



Create age





## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9889 entries, 0 to 9888
## Data columns (total 32 columns):
##  #   Column              Non-Null Count  Dtype         
## ---  ------              --------------  -----         
##  0   loan_status         9889 non-null   category      
##  1   pay_start_date      9889 non-null   datetime64[ns]
##  2   entry_dte           9887 non-null   datetime64[ns]
##  3   entrance_yr         9889 non-null   category      
##  4   exit_dte            9887 non-null   datetime64[ns]
##  5   exit_reason         9889 non-null   category      
##  6   loc_cde             9889 non-null   category      
##  7   div_cde             9889 non-null   category      
##  8   school              9889 non-null   category      
##  9   degr_cde            9889 non-null   category      
##  10  cip_desc            9889 non-null   category      
##  11  major_1             9889 non-null   category      
##  12  major_minor_desc    9889 non-null   category      
##  13  career_hrs_attempt  9887 non-null   float64       
##  14  career_hrs_earned   9887 non-null   float64       
##  15  local_hrs_attempt   9887 non-null   float64       
##  16  local_hrs_earned    9887 non-null   float64       
##  17  xfer_hrs_earned     9887 non-null   float64       
##  18  value_description   9889 non-null   category      
##  19  total_income        9857 non-null   category      
##  20  marital_status      9889 non-null   category      
##  21  efc                 9857 non-null   category      
##  22  dob                 9871 non-null   datetime64[ns]
##  23  gender              9889 non-null   category      
##  24  life_sub_grad       9889 non-null   float64       
##  25  life_sub_ugrad      9889 non-null   float64       
##  26  life_unsub_grad     9889 non-null   float64       
##  27  life_unsub_ugrad    9889 non-null   float64       
##  28  city                9889 non-null   category      
##  29  state               9889 non-null   category      
##  30  citizen_status      9889 non-null   category      
##  31  nslds_loan_total    9881 non-null   float64       
## dtypes: category(18), datetime64[ns](4), float64(10)
## memory usage: 1.7 MB





dflt_gr_3["age"]=dflt_gr_3["exit_dte"]-dflt_gr_3["dob"]


dflt_gr_3["age"]=dflt_gr_3["age"]/np.timedelta64(52,'W')





dflt_gr_3["age"].head()
## 0    46.931319
## 1    38.596154
## 2    30.184066
## 3    43.615385
## 4    41.332418
## Name: age, dtype: float64




dflt_gr_3['age']=dflt_gr_3['age'].round()




dflt_gr_3['age'].head()
## 0    47.0
## 1    39.0
## 2    30.0
## 3    44.0
## 4    41.0
## Name: age, dtype: float64





Create Exit time frame



dflt_gr_3["yrs_to_exit_dt"]=dflt_gr_3["exit_dte"]-dflt_gr_3["entry_dte"]


dflt_gr_3["yrs_to_exit_dt"]=dflt_gr_3["yrs_to_exit_dt"]/np.timedelta64(52,'W')




dflt_gr_3["yrs_to_exit_dt"].head()
## 0    3.590659
## 1    5.250000
## 2    0.997253
## 3    0.576923
## 4    2.631868
## Name: yrs_to_exit_dt, dtype: float64





dflt_gr_3["yrs_to_exit_dt"]=dflt_gr_3["yrs_to_exit_dt"].round(2)





dflt_gr_3["yrs_to_exit_dt"].head()
## 0    3.59
## 1    5.25
## 2    1.00
## 3    0.58
## 4    2.63
## Name: yrs_to_exit_dt, dtype: float64

Create payment date timeframe



dflt_gr_3["yrs_to_pay_dt"]=dflt_gr_3["pay_start_date"]-dflt_gr_3["entry_dte"]


dflt_gr_3["yrs_to_pay_dt"]=dflt_gr_3["yrs_to_pay_dt"]/np.timedelta64(52,'W')



dflt_gr_3["yrs_to_pay_dt"].head()
## 0    3.304945
## 1    5.337912
## 2    1.299451
## 3    0.664835
## 4    2.304945
## Name: yrs_to_pay_dt, dtype: float64


dflt_gr_3["yrs_to_pay_dt"]=dflt_gr_3["yrs_to_pay_dt"].round(2)


dflt_gr_3["yrs_to_pay_dt"].head()
## 0    3.30
## 1    5.34
## 2    1.30
## 3    0.66
## 4    2.30
## Name: yrs_to_pay_dt, dtype: float64









Undergrad Loans Borrowed at Cambridge College



dflt_grad_4<-dflt_grad_4 %>% 
  mutate(undergrad_loans_cc=life_sub_ugrad+life_unsub_ugrad)



Graduate Loans Borrowed at Cambridge College



dflt_grad_4<-dflt_grad_4 %>% 
  mutate(grad_loans_cc=life_sub_grad+life_unsub_grad)



Total Loans borrowed at Cambridge College



dflt_grad_4<-dflt_grad_4 %>% 
  mutate(total_loans_cc=life_sub_grad+life_unsub_grad+life_sub_ugrad+life_unsub_ugrad)



loans borrowed before CC.

dflt_grad_4<-dflt_grad_4 %>% 
  mutate(loans_not_cc=nslds_loan_total-total_loans_cc)



dflt_grad_4<-dflt_grad_4 %>% 
  mutate(value_description=recode(value_description, 'American Indian or Alaska Native' ="Other", "IPEDs value not calculated"="unknown", "Native Hawaiian or Other Pacific Islander"="Other", "Nonresident Alien"="unknown", "Race and Ethnicity unknown"="unknown", "Hispanics of any race"="Hispanic", "Black or African American"="African American", "Two or more races"="Other"))
dflt_grad_4<-dflt_grad_4 %>% 
  mutate(value_description=replace_na(value_description,"unknown"))
dflt_grad_4<-dflt_grad_4 %>% 
  rename("Race"="value_description")
dflt_grad_4 %>%  select_if(is.numeric) %>% summary()                                                
##  career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
##  Min.   :  1.00     Min.   :  0.00    Min.   :  1.00    Min.   :  0.00  
##  1st Qu.: 32.00     1st Qu.: 28.00    1st Qu.: 30.00    1st Qu.: 27.00  
##  Median : 38.00     Median : 36.00    Median : 38.00    Median : 36.00  
##  Mean   : 40.74     Mean   : 37.87    Mean   : 38.88    Mean   : 36.02  
##  3rd Qu.: 49.00     3rd Qu.: 48.00    3rd Qu.: 48.00    3rd Qu.: 45.00  
##  Max.   :147.00     Max.   :132.00    Max.   :135.00    Max.   :120.00  
##  NA's   :2          NA's   :2         NA's   :2         NA's   :2       
##  xfer_hrs_earned  life_sub_grad   life_sub_ugrad  life_unsub_grad
##  Min.   : 0.000   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.: 0.000   1st Qu.:    0   1st Qu.:    0   1st Qu.:12425  
##  Median : 0.000   Median :    0   Median :    0   Median :21667  
##  Mean   : 1.854   Mean   : 7043   Mean   : 1429   Mean   :23627  
##  3rd Qu.: 0.000   3rd Qu.:13925   3rd Qu.:    0   3rd Qu.:32500  
##  Max.   :90.000   Max.   :57289   Max.   :23000   Max.   :91812  
##  NA's   :2                                                       
##  life_unsub_ugrad nslds_loan_total      age       yrs_to_exit_dt  
##  Min.   :    0    Min.   :     0   Min.   :22.0   Min.   : 0.070  
##  1st Qu.:    0    1st Qu.: 24780   1st Qu.:31.0   1st Qu.: 1.340  
##  Median :    0    Median : 40184   Median :37.0   Median : 1.960  
##  Mean   : 1705    Mean   : 44860   Mean   :38.8   Mean   : 2.427  
##  3rd Qu.:    0    3rd Qu.: 60083   3rd Qu.:46.0   3rd Qu.: 2.890  
##  Max.   :41619    Max.   :161460   Max.   :76.0   Max.   :21.700  
##                   NA's   :8        NA's   :18     NA's   :2       
##  yrs_to_pay_dt    undergrad_loans_cc grad_loans_cc    total_loans_cc  
##  Min.   : 0.070   Min.   :    0      Min.   :     0   Min.   :     0  
##  1st Qu.: 1.270   1st Qu.:    0      1st Qu.: 19278   1st Qu.: 20500  
##  Median : 1.950   Median :    0      Median : 28735   Median : 30500  
##  Mean   : 2.412   Mean   : 3134      Mean   : 30670   Mean   : 33804  
##  3rd Qu.: 2.760   3rd Qu.:    0      3rd Qu.: 39004   3rd Qu.: 42654  
##  Max.   :21.700   Max.   :57500      Max.   :116599   Max.   :138500  
##  NA's   :2                                                            
##   loans_not_cc   
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :  1666  
##  Mean   : 12811  
##  3rd Qu.: 19574  
##  Max.   :138500  
##  NA's   :8



dflt_grad_5<-dflt_grad_4

Transform nslds_loan_total observations of zero to NA

dflt_grad_5<-dflt_grad_5 %>% 
  mutate(nslds_loan_total=na_if(nslds_loan_total,0))

fill NA values of nslds_loan_total with total_loans_cc observation

dflt_grad_5$nslds_loan_total[is.na(dflt_grad_5$nslds_loan_total)]<-dflt_grad_5$total_loans_cc[is.na(dflt_grad_5$nslds_loan_total)]
dflt_grad_5 %>%  select_if(is.numeric) %>% summary() 
##  career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
##  Min.   :  1.00     Min.   :  0.00    Min.   :  1.00    Min.   :  0.00  
##  1st Qu.: 32.00     1st Qu.: 28.00    1st Qu.: 30.00    1st Qu.: 27.00  
##  Median : 38.00     Median : 36.00    Median : 38.00    Median : 36.00  
##  Mean   : 40.74     Mean   : 37.87    Mean   : 38.88    Mean   : 36.02  
##  3rd Qu.: 49.00     3rd Qu.: 48.00    3rd Qu.: 48.00    3rd Qu.: 45.00  
##  Max.   :147.00     Max.   :132.00    Max.   :135.00    Max.   :120.00  
##  NA's   :2          NA's   :2         NA's   :2         NA's   :2       
##  xfer_hrs_earned  life_sub_grad   life_sub_ugrad  life_unsub_grad
##  Min.   : 0.000   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.: 0.000   1st Qu.:    0   1st Qu.:    0   1st Qu.:12425  
##  Median : 0.000   Median :    0   Median :    0   Median :21667  
##  Mean   : 1.854   Mean   : 7043   Mean   : 1429   Mean   :23627  
##  3rd Qu.: 0.000   3rd Qu.:13925   3rd Qu.:    0   3rd Qu.:32500  
##  Max.   :90.000   Max.   :57289   Max.   :23000   Max.   :91812  
##  NA's   :2                                                       
##  life_unsub_ugrad nslds_loan_total      age       yrs_to_exit_dt  
##  Min.   :    0    Min.   :     0   Min.   :22.0   Min.   : 0.070  
##  1st Qu.:    0    1st Qu.: 24950   1st Qu.:31.0   1st Qu.: 1.340  
##  Median :    0    Median : 40375   Median :37.0   Median : 1.960  
##  Mean   : 1705    Mean   : 45159   Mean   :38.8   Mean   : 2.427  
##  3rd Qu.:    0    3rd Qu.: 60078   3rd Qu.:46.0   3rd Qu.: 2.890  
##  Max.   :41619    Max.   :161460   Max.   :76.0   Max.   :21.700  
##                                    NA's   :18     NA's   :2       
##  yrs_to_pay_dt    undergrad_loans_cc grad_loans_cc    total_loans_cc  
##  Min.   : 0.070   Min.   :    0      Min.   :     0   Min.   :     0  
##  1st Qu.: 1.270   1st Qu.:    0      1st Qu.: 19278   1st Qu.: 20500  
##  Median : 1.950   Median :    0      Median : 28735   Median : 30500  
##  Mean   : 2.412   Mean   : 3134      Mean   : 30670   Mean   : 33804  
##  3rd Qu.: 2.760   3rd Qu.:    0      3rd Qu.: 39004   3rd Qu.: 42654  
##  Max.   :21.700   Max.   :57500      Max.   :116599   Max.   :138500  
##  NA's   :2                                                            
##   loans_not_cc   
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :  1666  
##  Mean   : 12811  
##  3rd Qu.: 19574  
##  Max.   :138500  
##  NA's   :8



dflt_grad_5<-dflt_grad_5 %>% 
  mutate(loans_not_cc=replace_na(loans_not_cc,0))
dflt_grad_5 %>% select(loans_not_cc) %>% summary()
##   loans_not_cc   
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :  1636  
##  Mean   : 12801  
##  3rd Qu.: 19574  
##  Max.   :138500



dflt_grad_5 <-dflt_grad_5 %>% 
  filter(nslds_loan_total >0)
dflt_grad_5 %>%  select_if(is.numeric) %>% summary() 
##  career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
##  Min.   :  1.00     Min.   :  0.00    Min.   :  1.00    Min.   :  0.00  
##  1st Qu.: 32.00     1st Qu.: 28.00    1st Qu.: 30.00    1st Qu.: 27.00  
##  Median : 38.00     Median : 36.00    Median : 38.00    Median : 36.00  
##  Mean   : 40.74     Mean   : 37.87    Mean   : 38.89    Mean   : 36.02  
##  3rd Qu.: 49.00     3rd Qu.: 48.00    3rd Qu.: 48.00    3rd Qu.: 45.00  
##  Max.   :147.00     Max.   :132.00    Max.   :135.00    Max.   :120.00  
##  NA's   :2          NA's   :2         NA's   :2         NA's   :2       
##  xfer_hrs_earned  life_sub_grad   life_sub_ugrad  life_unsub_grad
##  Min.   : 0.000   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.: 0.000   1st Qu.:    0   1st Qu.:    0   1st Qu.:12450  
##  Median : 0.000   Median :    0   Median :    0   Median :21667  
##  Mean   : 1.854   Mean   : 7044   Mean   : 1429   Mean   :23632  
##  3rd Qu.: 0.000   3rd Qu.:13925   3rd Qu.:    0   3rd Qu.:32500  
##  Max.   :90.000   Max.   :57289   Max.   :23000   Max.   :91812  
##  NA's   :2                                                       
##  life_unsub_ugrad nslds_loan_total      age       yrs_to_exit_dt  
##  Min.   :    0    Min.   :   361   Min.   :22.0   Min.   : 0.070  
##  1st Qu.:    0    1st Qu.: 24950   1st Qu.:31.0   1st Qu.: 1.340  
##  Median :    0    Median : 40376   Median :37.0   Median : 1.960  
##  Mean   : 1705    Mean   : 45168   Mean   :38.8   Mean   : 2.427  
##  3rd Qu.:    0    3rd Qu.: 60080   3rd Qu.:46.0   3rd Qu.: 2.890  
##  Max.   :41619    Max.   :161460   Max.   :76.0   Max.   :21.700  
##                                    NA's   :18     NA's   :2       
##  yrs_to_pay_dt    undergrad_loans_cc grad_loans_cc    total_loans_cc  
##  Min.   : 0.070   Min.   :    0      Min.   :     0   Min.   :     0  
##  1st Qu.: 1.270   1st Qu.:    0      1st Qu.: 19300   1st Qu.: 20500  
##  Median : 1.950   Median :    0      Median : 28735   Median : 30500  
##  Mean   : 2.412   Mean   : 3135      Mean   : 30676   Mean   : 33811  
##  3rd Qu.: 2.760   3rd Qu.:    0      3rd Qu.: 39006   3rd Qu.: 42654  
##  Max.   :21.700   Max.   :57500      Max.   :116599   Max.   :138500  
##  NA's   :2                                                            
##   loans_not_cc   
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :  1653  
##  Mean   : 12804  
##  3rd Qu.: 19574  
##  Max.   :138500  
## 



dflt_grad_5<-dflt_grad_5 %>% 
  mutate(marital_status=recode(marital_status,"1"="single","2"="married", "3"="separated", "4"="divorced"))





Exploratory Data Analysis



Univariate





gender_cnt<-dflt_grad_5 %>% 
  filter(gender %in% c("F","M")) %>% 
  count(gender)



fig <- plot_ly(gender_cnt, labels = ~gender, values = ~n, type = 'pie')
               
  
  

fig <- fig %>% layout(title = 'Enrollment by Gender',

         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),

         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))



Figure 2





rac_pct<-dflt_grad_5 %>% 
  filter(Race %in% c("White", "African American", "Hispanic","Asian", "Other")) %>% 
  count(Race) %>% 
  mutate(perc=round(n/ sum(n),3))
race_perc_plt<-ggplot(rac_pct, aes(x=fct_reorder(Race, perc), y=perc,color=Race ,group=1, text=paste("Program: ",Race ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=Race, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Enrollment by Race")



Figure 3

Figure 3





major_cnt<-dflt_grad_5 %>% 
  rename(major=major_minor_desc)
top_majors<-major_cnt %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=10)
topMajor_plt<-ggplot(top_majors, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Ten Majors")





Figure 4

Figure 4





loan_stat_pct<-dflt_grad_5 %>% 
  count(loan_status) %>% 
  mutate(perc=round(n/ sum(n),3))



loan_stat_table <- loan_stat_pct %>% 
  select(loan_status, perc) %>% 
  ggtexttable(rows = NULL,  theme = ttheme("mBlue", base_size=14)) %>% tab_add_title("Loan Status Percentage",face="bold", size=8) %>% 
  tab_add_footnote("Table 12", size=9)





grad_loans<-ggplot(dflt_grad_5, aes(x=grad_loans_cc))+
  geom_histogram(fill="green")+
  geom_vline(aes(xintercept=mean(grad_loans_cc)),
            color="red", linetype="dashed", linewidth=1)+
   theme(plot.title = element_text(color="blue", size=10.5, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Distribuition of Graduate Level Loan Totals Borrowed at College")



Figure 5

Figure 5





undergrad_loans<-ggplot(dflt_grad_5, aes(x=undergrad_loans_cc))+
  geom_histogram(fill="blue", bins=10)+
  geom_vline(aes(xintercept=mean(undergrad_loans_cc)),
            color="red", linetype="dashed", linewidth=1)+
   theme(plot.title = element_text(color="blue", size=9.5, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Distribuition of Undergradute Level Loan Totals Borrowed at College")





Figure 6

Figure 6





non_cc_loan<-ggplot(dflt_grad_5, aes(x=loans_not_cc))+
  geom_histogram(fill="yellow", bins=20)+
  geom_vline(aes(xintercept=mean(loans_not_cc)),
            color="red", linetype="dashed", linewidth=1)+
   theme(plot.title = element_text(color="blue", size=9.5, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Distribuition of  Loan Totals Borrowed at Other Colleges")





Figure 7

Figure 7





Multivariate



gender_dflt<-dflt_grad_5 %>% 
  filter(gender %in% c("F","M")) %>% 
  count(gender, loan_status) %>% 
  group_by(gender) %>% 
  mutate(perc = n / sum(n)) %>% 
  ggplot(aes(x=gender, y=perc, fill=loan_status))+
  geom_col(position="dodge")+
  labs(x = "Gender", y = "Percent", fill = " Loan Status")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=.05)+
  theme(axis.ticks.x=element_blank())+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=25, size=10,hjust=.9))+
  theme(axis.text.y=element_text(size=12), axis.title.x=element_blank())+
  ggtitle("Loan Default by Gender")+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.title.y=element_blank(),  axis.text.y = element_blank(),axis.ticks = element_blank())+
  theme(legend.position = "none")
  theme(legend.position = "none")
## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE





Figure 8

Figure 8





gender_loan_bx<-dflt_grad_5 %>%
  filter(gender %in% c("F","M")) %>% 
  select(total_loans_cc, gender) %>% 
  ggplot(aes(x=total_loans_cc, y=gender, fill=gender))+
  geom_boxplot()+
  #facet_wrap(~degr_cde)+
   #labs(x = "Years Enrolled", y = "YDegree")+
 # theme(legend.position = "none")
  ggtitle('Total Loans at College by Gender')+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text( size=8))+
   theme(axis.title.y=element_blank(),  axis.title.x=element_blank())+
   theme(axis.text.y=element_text(size=10))





Figure 9

Figure 9



citizen_dflt<-dflt_grad_5 %>% 
  count(citizen_status, loan_status) %>% 
  group_by(citizen_status) %>% 
  mutate(perc = n / sum(n)) %>% 
  ggplot(aes(x=citizen_status, y=perc, fill=loan_status))+
  geom_col(position="dodge")+
  labs(x = "Citizen Status", y = "Percent", fill = " Loan Status")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=.05)+
  theme(axis.ticks.x=element_blank())+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(size=10,hjust=.9))+
  theme(axis.text.y=element_text(size=12), axis.title.x=element_blank())+
  ggtitle("Loan Default by Citizen Status")+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5))+
  theme(axis.title.y=element_blank(),  axis.text.y = element_blank(),axis.ticks = element_blank())+
  theme(legend.position = "none")
  theme(legend.position = "none")
## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE



Figure 10

Figure 10





race_loan_bx<-dflt_grad_5 %>%
  filter(Race %in% c("White", "Hispanic", "African American", "Asian", "Other")) %>% 
  #select(total_loans_cc, Race) %>% 
  ggplot(aes(x=Race, y=total_loans_cc,fill=Race))+
  geom_boxplot()+
  facet_wrap(vars(Race))+
   #labs(x = "Years Enrolled", y = "YDegree")+
  theme(legend.position = "none")+
  ggtitle('Total Loans at College by Race')+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank())+
   theme(axis.ticks.x=element_blank(),legend.position = "none")+
  theme(axis.text.x=element_blank())+
   theme(axis.title.y=element_blank())  
    theme(axis.title.x=element_blank())+
   theme(axis.text.y=element_blank())
## List of 2
##  $ axis.title.x: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.text.y : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE



050000100000050000100000
Total Loans at College by Race RaceOtherAsianAfrican AmericanHispanicWhite

Figure 11





gender_strip<-dflt_grad_5 %>% 
  filter(gender %in% c("F", "M"))

strp_1<-ggplot(gender_strip,aes(x=age, y=gender)) + 
  geom_jitter(position=position_jitter(0.2))+
  ggtitle("Age by Gender")
strp_2<-strp_1 + coord_flip()

strp_3<-strp_2 + stat_summary(fun=median, geom="point", shape=18,
                 size=3, color="red")

strp_4<-strp_3 +scale_color_grey() + theme_classic()



Figure 12

Figure 12

Median Age: Female 38 Male 39



Median Age: Female 38 Male 39

degr_pct<-dflt_grad_5 %>% 
  filter(degr_cde %in% c("MED", "CAGS", "MM", "MBA", "CT","PHD"))%>%
  count(degr_cde) %>% 
  mutate(perc=round(n/ sum(n),3))



deg_plt<-ggplot(degr_pct, aes(x = reorder(degr_cde, -perc), y = perc, fill=degr_cde,group = 1, text = paste("Degree: ", degr_cde, "Precent: ", perc)))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree", y = "Percent", fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  #geom_text(aes(label=percent(perc),
                #y=perc + .02),
            #position=position_dodge(0.9),
           # vjust=.10, size=3.0)+
  ggtitle("Degree as Percentage of Enrollment")+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(size=7,hjust=1))+
   theme(axis.title.y=element_blank(),  axis.title.x=element_blank())+
   theme(axis.text.y=element_text(size=8))



MEDCAGSMMMBACTPHD0%20%40%60%80%
Degree as Percentage of Enrollment

Figure 13





school_pct<-dflt_grad_5 %>% 
  filter(school %in% c("SOP", "SOE", "SOM","NIB")) %>% 
  count(school) %>% 
  mutate(perc=round(n/ sum(n),3))
sch_plt<-ggplot(school_pct, aes(x = reorder(school, -perc), y = perc, fill=school,group = 1, text = paste("Scool: ", school, "Precent: ", perc)))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  #geom_text(aes(label=percent(perc),
               # y=perc + .02),
            #position=position_dodge(0.9),
            #vjust=.10, size=3.0)+
  theme(axis.ticks.x=element_blank(),
        axis.text=element_text(size=8),legend.position="none")+
  ggtitle("Percentage of Enrollment by School")+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
   theme(axis.title.y=element_blank(),  axis.title.x=element_blank())+
   theme(axis.text.y=element_text(size=7))



SOESOPSOMNIB0%20%40%
Percentage of Enrollment by School

Figure 14





enroll_lgth_bx<-dflt_grad_5 %>%
  filter(degr_cde %in% c("MED", "CAGS", "MM", "MBA", "CT","PHD")) %>%
  select(yrs_to_exit_dt, degr_cde) %>% 
  ggplot(aes(x=yrs_to_exit_dt, y=degr_cde, fill=degr_cde))+
  geom_boxplot()+
  #facet_wrap(~degr_cde)+
   #labs(x = "Years Enrolled", y = "YDegree")+
 # theme(legend.position = "none")
  ggtitle("Years Enrolled by Degree")+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text( size=7))+
   theme(axis.title.y=element_blank(),  axis.title.x=element_blank())+
   theme(axis.text.y=element_text(size=8))





Figure 15

Figure 15





degr_dflt_perc<-dflt_grad_5 %>% 
  filter(degr_cde %in% c("MED", "CAGS", "MM", "MBA", "CT","PHD"))%>%
count(degr_cde, loan_status) %>% 
   group_by(degr_cde) %>% 
  mutate(perc = n / sum(n))
degr_dflt_perc_y<-degr_dflt_perc %>% 
  spread(loan_status, perc)
degr_dflt_plt<-ggplot(degr_dflt_perc_y, aes(x = reorder(degr_cde, -Y), y = Y, fill=degr_cde))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree",  fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(Y),
                y=Y + .01),
            position=position_dodge(0.9),
            vjust=4.0)+
  theme(axis.ticks.x=element_blank(),legend.position = "none") + 
   theme(plot.title = element_text(color="blue", size=12, face="bold.italic", hjust=0.5),axis.title.y=element_blank(),axis.text.y=element_text(size=7))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  ggtitle(" Default Percentage by Degree")





Figure 16

Figure 16







degr_bx_plt<-dflt_grad_5 %>%
  filter(degr_cde %in% c("MED", "CAGS", "MM", "MBA", "CT","PHD"))%>%
  ggplot(aes(x=total_loans_cc, y=degr_cde, fill=degr_cde))+
  geom_boxplot()+
  #facet_wrap(vars(degr_cde))+
  theme(legend.position = "none",
        #axis.title.x=element_blank(),
        #axis.title.y=element_blank(),
        #axis.text.x=element_blank(),
        plot.title=element_text(size=10, hjust=0.5))+
  ggtitle("Total Loans Borrowed at College by Degree")+
  theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text( size=8))+
   theme(axis.title.y=element_blank(),  axis.title.x=element_blank())+
   theme(axis.text.y=element_text(size=8.5))



Figure 17

Figure 17





sch_dflt<-dflt_grad_5 %>% 
  filter(school %in% c("SOE", "SOM", "SOP"))%>%
count(school, loan_status) %>% 
   group_by(school) %>% 
  mutate(perc = n / sum(n))
school_dflt_perc_y<-sch_dflt %>% 
  spread(loan_status, perc)
sch_dflt_plt<-ggplot(school_dflt_perc_y, aes(x = reorder(school, -Y), y = Y, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School",  fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(Y),
                y=Y + .007),
            position=position_dodge(0.9),
            vjust=4.0)+
  theme(axis.ticks.x=element_blank(),legend.position = "none") + 
   theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),axis.title.y=element_blank(),axis.text.y=element_text(size=7),axis.text.x=element_text(size=8))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  ggtitle(" Default Percentage by School")





Figure 18

Figure 18





rce_dflt<-dflt_grad_5 %>% 
  filter(Race %in% c("White", "Hispanic", "African American", "Asian", "Other")) %>% 
count(Race, loan_status) %>% 
   group_by(Race) %>% 
  mutate(perc = round(n / sum(n),3))
rce_dflt_perc_y<-rce_dflt %>% 
  spread(loan_status, perc)
rce_dflt_plt<-ggplot(rce_dflt_perc_y, aes(x = reorder(Race, -Y), y = Y, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race",  fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(Y),
                y=Y + .008),
            position=position_dodge(0.9),
            vjust=4.0)+
  theme(axis.ticks.x=element_blank(),legend.position = "none") + 
   theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),axis.title.y=element_blank(),axis.text.y=element_text(size=7),axis.text.x=element_text(size=8))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  ggtitle(" Default Percentage by Race")





Figure 19

Figure 19





sch_loan_bx<-dflt_grad_5 %>%
   filter(school %in% c("NIB", "SOE", "SOM", "SOP"))%>% 
  ggplot(aes(x=total_loans_cc, y=school, fill=school))+
  geom_boxplot()+
  ggtitle("Total Loans borrowed at Cambridge College by School")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(axis.ticks.x=element_blank(),legend.position = "none", axis.text.x=element_text(angle=70,size=8, hjust=1),axis.text.y=element_text(size=7),
        axis.title.x=element_blank(),
    axis.title.y=element_blank())



Figure 20

Figure 20





sch_deg_plt <- ggplot(sch_deg, aes(y = degr_cde, x = school, color = school)) +
  geom_count(aes(group = degr_cde)) +
  scale_size_area(max_size = 8) +
  labs(title = "Degree Programs and Schools",
       size = "Degree Count Size", Size = "Degree Count") +
  guides(color = "none") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(angle = 70, size = 8, hjust = 1),
    axis.text.y = element_text( size = 8),
    plot.title = element_text(color = "blue", size = 11, face = "bold.italic", hjust = 0.5)
  ) +
  theme(legend.text = element_text(colour = "blue", size = 8, face = "bold")) +
  theme(legend.title = element_text(size = 10, face = "bold")) +
  theme(legend.background = element_rect(fill = "lightblue", linewidth = 0.5, linetype = "solid", colour = "darkblue"))





Figure 21

Figure 21





corr_plot<-dflt_grad_5 %>% 
  select(loan_status,local_hrs_attempt, local_hrs_earned,age, yrs_to_exit_dt, efc, undergrad_loans_cc, grad_loans_cc, loans_not_cc,yrs_to_pay_dt) %>% 
  ggscatmat(color="loan_status", 
            corMethod = "pearson",
            alpha=0.2)





Figure 22

Figure 22







Pre-processing





Missing Values

dflt_grad_6 %>% 
  select(efc, age) %>% 
  summary()
##       efc            age      
##  Min.   :   1   Min.   :22.0  
##  1st Qu.:   2   1st Qu.:31.0  
##  Median : 813   Median :37.0  
##  Mean   :1103   Mean   :38.8  
##  3rd Qu.:2062   3rd Qu.:46.0  
##  Max.   :3237   Max.   :76.0  
##  NA's   :32     NA's   :18



The above output shows missing values for both efc and age features.

We will deal with these missing values using median imputation.

dflt_grad_6<-dflt_grad_6 %>% 
 mutate(age
         = replace(age,
                   is.na(age),
                   median(age, na.rm = TRUE)),
        efc=replace(efc,is.na(efc),
                    median(efc,na.rm=TRUE)))





class_data<-dflt_grad_6 %>% 
  select(loan_status, school,degr_cde, local_hrs_attempt,gender, citizen_status, age,
         yrs_to_pay_dt,Race, marital_status, undergrad_loans_cc,grad_loans_cc,loans_not_cc, efc)

Filter out unknown level

class_data<-class_data %>% 
  filter(Race %in% c("African American", "White", "Hispanic", "Asian","Other"))

Combine degr_cde and school features into one feature.

class_data_2<-class_data %>% 
 mutate(degrCde_school=fct_cross(degr_cde,school, sep="_"))





Before we embark on model building, a crucial preliminary step is pre-processing. This involves splitting our data into training, and test sets, as well as the transformation of numerical and categorical features into formats conducive to classification.







from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, LabelEncoder


from sklearn.compose import ColumnTransformer





from sklearn.pipeline import Pipeline,make_pipeline



# for model evaluation
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay, accuracy_score, roc_auc_score, recall_score,  RocCurveDisplay

#plot_roc_curve,


from sklearn.model_selection import cross_val_score, GridSearchCV,KFold, RandomizedSearchCV, permutation_test_score

from sklearn.model_selection import StratifiedKFold,cross_validate

#from sklearn.model_selection import RepeatedKFold

from numpy import mean

#import numpy as np


from sklearn.model_selection import cross_val_predict


from sklearn.inspection import permutation_importance

from sklearn.compose import make_column_selector as selector


from collections import Counter

import matplotlib.pyplot as plt

import seaborn as sns

import imblearn

from imblearn.over_sampling import SMOTE, SVMSMOTE

from imblearn.pipeline import Pipeline, make_pipeline
## The  Target categories: Index(['Y', 'N'], dtype='object'):



We will separate the data to get predictor features and target features.





X=dflt_cl_data.drop("loan_status", axis=1)

y=dflt_cl_data["loan_status"]
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9327 entries, 0 to 9326
## Data columns (total 12 columns):
##  #   Column              Non-Null Count  Dtype   
## ---  ------              --------------  -----   
##  0   local_hrs_attempt   9327 non-null   float64 
##  1   gender              9327 non-null   category
##  2   citizen_status      9327 non-null   category
##  3   age                 9327 non-null   int32   
##  4   yrs_to_pay_dt       9327 non-null   float64 
##  5   Race                9327 non-null   category
##  6   marital_status      9327 non-null   category
##  7   undergrad_loans_cc  9327 non-null   float64 
##  8   grad_loans_cc       9327 non-null   float64 
##  9   loans_not_cc        9327 non-null   float64 
##  10  efc                 9327 non-null   int32   
##  11  degrCde_school      9327 non-null   category
## dtypes: category(5), float64(5), int32(2)
## memory usage: 483.0 KB
## None





From the above output we can see that the target feature has been removed.



## <class 'pandas.core.series.Series'>
## RangeIndex: 9327 entries, 0 to 9326
## Series name: loan_status
## Non-Null Count  Dtype   
## --------------  -----   
## 9327 non-null   category
## dtypes: category(1)
## memory usage: 9.2 KB
## None



the y output shows only the loan_status feature.



The data type of the target feature is categorical. Most machine learning algorithms require numerical data types. Label Encoder will be used to transform y to a numeric tpye.





label_encoder=LabelEncoder()



Transform y to data frame from series


y=pd.DataFrame(y)




y["loan_status"]=label_encoder.fit_transform(y["loan_status"])



##    loan_status
## 0            0
## 1            0
## 2            0
## 3            0
## 4            0



## dtype('int64')



From the two above outputs we can see that the target feature has been converted into binary numerical data type. We will convert the data type back to categorical.





## Target Feature categories as binary:  Index([0, 1], dtype='int64'):



## Shape of Predictor Features is (9327, 12):



## Shape of Target Feature is (9327, 1):



## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9327 entries, 0 to 9326
## Data columns (total 12 columns):
##  #   Column              Non-Null Count  Dtype   
## ---  ------              --------------  -----   
##  0   local_hrs_attempt   9327 non-null   float64 
##  1   gender              9327 non-null   category
##  2   citizen_status      9327 non-null   category
##  3   age                 9327 non-null   int32   
##  4   yrs_to_pay_dt       9327 non-null   float64 
##  5   Race                9327 non-null   category
##  6   marital_status      9327 non-null   category
##  7   undergrad_loans_cc  9327 non-null   float64 
##  8   grad_loans_cc       9327 non-null   float64 
##  9   loans_not_cc        9327 non-null   float64 
##  10  efc                 9327 non-null   int32   
##  11  degrCde_school      9327 non-null   category
## dtypes: category(5), float64(5), int32(2)
## memory usage: 483.0 KB



The makeup of the X data frame is 9,326 rows and 12 columns while the y data frame’s has 9326 rows and 1 column.





We will now split X,y into Train and Test sets








X_train, X_test,y_train, y_test=train_test_split(X,y, train_size=0.75)



## Shape of X Train (6995, 12):
## Shape of X Test (2332, 12):
## Shape of y Train (6995, 1):
## Shape of y Test (2332, 1):





y features will be transformed to numpy arrays





y_train_np=np.array(y_train)



y_test_np=np.array(y_test)



## Shape of y Train  (6995, 1):
## Shape of y Test  (2332, 1):



Next we will transform y features to one dimensional arrays






y_train_rv=np.ravel(y_train_np)



y_test_rv=np.ravel(y_test_np)



## Shape of y Train rv (6995,):
## Shape of y Test rv (2332,):





From the above output we see that y train and y test have been transformed into one dimensional numpy arrays.





Transform Categorical and Numerical features



Our next step involves transforming the predictor features into formats compatible with machine learning.

Numerical feature transformation is achieved through scaling. This process is crucial to prevent a feature with a wide range, such as in the thousands, from being considered more significant than a feature with a narrower range. Scaling ensures that all features hold equal importance before being applied to a machine learning algorithm. Various methods exist for scaling features, and for this analysis, we will use standard scaling. This technique transforms the data to have a zero mean and a variance of one, rendering the data unitless.

Given that most machine learning algorithms only accept numerical features, categorical features, in their original form, are deemed unacceptable. Therefore, it is necessary to encode these categorical features into numerical values, a process known as categorical encoding. In this analysis, we will employ one-hot encoding. This method represents categorical features as a set of binary features, with each binary feature corresponding to a specific category. The binary feature takes the integer value 1 if the category is present and 0 otherwise



set_config configures pre processing steps such as Standard Scaler and One Hot Encoding to return a Pandas DataFrame





from sklearn import set_config






set_config(transform_output="pandas")



#Define classification columns
categorical = list(X_train.select_dtypes('category').columns)



## Categorical columns are: ['gender', 'citizen_status', 'Race', 'marital_status', 'degrCde_school']







#Define numeric columns
numerical = list(X_train.select_dtypes('number').columns)



## Numerical columns are: ['local_hrs_attempt', 'age', 'yrs_to_pay_dt', 'undergrad_loans_cc', 'grad_loans_cc', 'loans_not_cc', 'efc']





First, we will create transformed train and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.






ct_lr=ColumnTransformer(
  transformers=[
   ('scale',StandardScaler(), numerical),
   ('ohe',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first'), categorical)
]
)







## ************First Five Rows X_train_lr************
##       scale__local_hrs_attempt  ...  ohe__degrCde_school_PHD_SOE
## 572                  -0.005663  ...                          0.0
## 3382                  0.767671  ...                          0.0
## 8832                 -0.177515  ...                          0.0
## 1304                  0.051621  ...                          0.0
## 1734                  0.166189  ...                          0.0
## 
## [5 rows x 25 columns]




X_train_lr.describe().T
##                                            count  ...       max
## scale__local_hrs_attempt                  6995.0  ...  5.035331
## scale__age                                6995.0  ...  3.423558
## scale__yrs_to_pay_dt                      6995.0  ...  9.616802
## scale__undergrad_loans_cc                 6995.0  ...  5.684823
## scale__grad_loans_cc                      6995.0  ...  4.982445
## scale__loans_not_cc                       6995.0  ...  6.141727
## scale__efc                                6995.0  ...  1.964071
## ohe__gender_M                             6995.0  ...  1.000000
## ohe__citizen_status_eligible_non_citizen  6995.0  ...  1.000000
## ohe__Race_Asian                           6995.0  ...  1.000000
## ohe__Race_Hispanic                        6995.0  ...  1.000000
## ohe__Race_Other                           6995.0  ...  1.000000
## ohe__Race_White                           6995.0  ...  1.000000
## ohe__marital_status_married               6995.0  ...  1.000000
## ohe__marital_status_separated             6995.0  ...  1.000000
## ohe__marital_status_single                6995.0  ...  1.000000
## ohe__degrCde_school_CAGS_SOP              6995.0  ...  1.000000
## ohe__degrCde_school_CT_SOE                6995.0  ...  1.000000
## ohe__degrCde_school_CT_SOP                6995.0  ...  1.000000
## ohe__degrCde_school_MBA_SOM               6995.0  ...  1.000000
## ohe__degrCde_school_MED_SOE               6995.0  ...  1.000000
## ohe__degrCde_school_MED_SOP               6995.0  ...  1.000000
## ohe__degrCde_school_MM_SOM                6995.0  ...  1.000000
## ohe__degrCde_school_PHD_NIB               6995.0  ...  1.000000
## ohe__degrCde_school_PHD_SOE               6995.0  ...  1.000000
## 
## [25 rows x 8 columns]





X_test_lr=ct_lr.transform(X_test)





## ************First Five Rows X_test_lr************
##       scale__local_hrs_attempt  ...  ohe__degrCde_school_PHD_SOE
## 2851                 -1.609616  ...                          0.0
## 4926                 -0.521219  ...                          0.0
## 8628                  1.197302  ...                          0.0
## 4069                  0.968166  ...                          0.0
## 7763                 -1.323195  ...                          0.0
## 
## [5 rows x 25 columns]



Upon examining the initial five rows of both the training and test sets, it is evident that the features have undergone transformation while concurrently preserving the column feature names.



## Shape of X Train lr (6995, 25):
## Shape of X Test lr (2332, 25):





Next, we transform training and test sets for all other models. During OneHotEncoding, the first category will be dropped only if the feature is binary.





ct_tr=ColumnTransformer(
  transformers=[
   ('num',StandardScaler(), numerical),
   ('cat',OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'), categorical)
]
)





X_train_tr=ct_tr.fit_transform(X_train)



## ************First Five Rows X_train_tr************
##       num__local_hrs_attempt  ...  cat__degrCde_school_PHD_SOE
## 572                -0.005663  ...                          0.0
## 3382                0.767671  ...                          0.0
## 8832               -0.177515  ...                          0.0
## 1304                0.051621  ...                          0.0
## 1734                0.166189  ...                          0.0
## 
## [5 rows x 28 columns]





X_test_tr=ct_tr.transform(X_test)



## ************First Five Rows X_test_tr************
##       num__local_hrs_attempt  ...  cat__degrCde_school_PHD_SOE
## 2851               -1.609616  ...                          0.0
## 4926               -0.521219  ...                          0.0
## 8628                1.197302  ...                          0.0
## 4069                0.968166  ...                          0.0
## 7763               -1.323195  ...                          0.0
## 
## [5 rows x 28 columns]



## Shape of X Train tr (6995, 28):
## Shape of X Test tr (2332, 28):



In comparing the shape outputs of both the training and test sets, we observe three additional columns compared to the logistic regression transformed data



From the Loan Status table in the Exploratory Data Analysis section, we can deduce that only 6% of loan borrowers in the dataset have defaulted. It is evident that this dataset is imbalanced, where the target class exhibits an uneven distribution of observations—specifically, one class label has a considerably higher number of observations, while the other has a significantly lower number.

Classes that constitute a large proportion of the dataset are referred to as majority classes, whereas those making up a smaller proportion are considered minority classes. To address this imbalance, we will implement oversampling on the minority class (those who defaulted). The oversampling process will involve utilizing the SMOTE algorithm, which generates synthetic (resampled) data based on the characteristics of the nearest neighbors

The smote technique is only used on the training data as synthetic data should not be used for applying the model to the test data.





resampler_l=SMOTE(random_state=0)








X_train_sm_lr,y_train_sm_lr=resampler_l.fit_resample(X_train_lr, y_train_rv)



counter_ytrainLR=Counter(y_train_rv)

counter_ytrain_smoteLR=Counter(y_train_sm_lr)



## Original and Resampled Training Target Feature Categories:
## 
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
## 
## Resampled Target Feature y Train sm lr Counter({0: 6608, 1: 6608}):





After resampling, the category 1 (Y) now has the same number of rows as category 0 (N)



## Loan Status categories resampled:  (13216, 25):





Classification Models



from sklearn.metrics import  precision_score, f1_score,make_scorer


from sklearn import metrics





For the purpose of evaluating model performance, the event of interest for our analysis is if the loan status is Y (defaulted). This is considered the positive class.

Classification metrics will determine how well our models predict the event of interest.



Metrics Definitions

Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)

Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them actually did default? So within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)

Recall- tells us about how well the model identifies true positives. The model may find a lot of positives yet it also will wrongly detects many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So within everything that actually is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)

F1 Score-The F1 score is defined as the harmonic mean of precision and recall.

The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean

The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))

Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:





Logistic Regression



Logistic Regression is a binary classification model that finds the probability or odds ratio of an event. Our model has two events or possible outcomes, yes or no. A probability between 0 and 1 of an observation is produced for both events. Example: Probability of an observation being “Y” is .65 whereas the probability of “N” for the same observation is .35.

If you would like to know more about Logistic Regression check out the link below.



Cross validation and Paramneters

We will partition the training set into equal subsets. The subsets are used to assess a model’s performance on training data through cross validation.

The process works by setting aside the first fold as a test set and the remaining subsets are used as the aggregated training set. The model is trained on the aggregated training set then the performance is evaluated on the testing set. This will continue until all folds have been held out as a test set. An evaluation metric is calculated for each iteration then averaged together. This results in a cross validated metric.




logreg=LogisticRegression()



Set grid to find best parameters



lr_params={'C': [0.001, 0.01, 0.1, 1, 10], 
'penalty': ['l2'],
'max_iter': list(range(2000, 3000,10000)),
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

}



Set number of cross validation folds




skf_lr = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)



np.random.seed(0)

lr_search=RandomizedSearchCV(logreg, lr_params, refit=True, 
verbose=3,cv=5,n_iter=25, scoring='roc_auc',return_train_score=True, n_jobs=-1)


lr_search.fit(X_train_sm_lr, y_train_sm_lr)
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=25, n_jobs=-1,
                   param_distributions={'C': [0.001, 0.01, 0.1, 1, 10],
                                        'max_iter': [2000], 'penalty': ['l2'],
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                                                   'saga']},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Below are the best paramters selectd for the logistic regression model.

LogisticRegression(C=0.1, max_iter=2000, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Validation Metrics



lr_cv_acc_score=cross_val_score(lr_clf, X_train_sm_lr, y_train_sm_lr, 
scoring='accuracy', cv=skf_lr).mean()



## Average Training Accuracy Score: (0.656)





lr_cv_auc=cross_val_score(lr_clf, X_train_sm_lr, y_train_sm_lr,
scoring='roc_auc', cv=skf_lr).mean()



## Average Training AUC Score: (0.697)



Test Set



y_pred_lr=lr_clf.predict(X_test_lr)

Test Set Metrics



## ****Logistic RegressionValidation Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.97      0.57      0.72      2202
##            1       0.08      0.66      0.15       130
## 
##     accuracy                           0.57      2332
##    macro avg       0.52      0.62      0.43      2332
## weighted avg       0.92      0.57      0.68      2332



## Area Under the Curve Score-Log Reg Test Set: (0.615)






cm_lr_test = metrics.confusion_matrix(y_test_rv, y_pred_lr, labels=[0,1])
df_cm_lr_test = pd.DataFrame(cm_lr_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_lr_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_lr_test.flatten()/np.sum(cm_lr_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(11,8))
sns.heatmap(df_cm_lr_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Logistic Regression", fontsize=14)



Figure 23

Figure 23





Feature Importance





feature_importance_lr=pd.DataFrame({'feature':list(X_test_lr.columns),'feature_importance':[abs(i) for i in lr_clf.coef_[0]]})


feature_importance_lr=feature_importance_lr.sort_values('feature_importance',ascending=False)



For the logistcal regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.



Now that we have the importance of the features we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.




#Combine feature names and coefficients on top Pandas DataFrame
feature_names_lr=pd.DataFrame(X_test_lr.columns, columns=['Feature'])

log_coef=pd.DataFrame(np.transpose(lr_clf.coef_), columns=['Coefficent'])

coefficients=pd.concat([feature_names_lr, log_coef], axis=1)

#Calculate exponent of the logistic regression coefficients

coefficients['Exp_Coefficient']=np.exp(coefficients['Coefficent'])


#Remove coefficients that are equal to zero.

coefficients=coefficients[coefficients['Exp_Coefficient']>=1]



## ******************Top Five Coefficients******************
##                         Feature  Exp_Coefficient
## 22   ohe__degrCde_school_MM_SOM         2.272679
## 5           scale__loans_not_cc         1.172586
## 21  ohe__degrCde_school_MED_SOP         1.125975
## 20  ohe__degrCde_school_MED_SOE         1.083622
## 2          scale__yrs_to_pay_dt         1.083519





Analysis



The logistic regression model did not perform well based on the cross-validation scores. The accuracy score of 0.66 translates to correctly classifying an observation only 66% of the time. The AUC score of 0.697 places the model in the average category.

Metrics for model performance on the test set (unseen data) were worse. The accuracy score of 0.576 means correctly classifying an observation only 57%, while the AUC score of 0.61 categorizes the performance of this model as below average. Supporting this assessment are the recall and precision scores from the Classification Report. Recall indicates that the model captures approximately six out of ten possible observations for the event of interest (Y). When the event of interest captures less than one out of ten, it is accurate.





Support Vector Machine



Linear Model




from sklearn.svm import SVC


resampler_sv=SVMSMOTE()



X_train_sm_sv,y_train_sm_sv=resampler_sv.fit_resample(X_train_lr, y_train_rv)





## Original and Resampled Training Target Feature Categories:
## 
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
## 
## Resampled Target Feature y Train sm lr Counter({0: 6608, 1: 6608}):



Train Model



model_sv=SVC()


param_grid_sv={
  'C': [0.1, 1, 10, 10, 1000],
  'gamma': [1,0.1,0.001, 0.0001],
  'kernel':['rbf','linear', 'poly','sigmoid'],
  'degree':[1,3, 5,7],
  'coef0':[0,1, 3,5]
}



skf_sv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

grid_sv=RandomizedSearchCV(model_sv,param_grid_sv, cv=5,n_iter=150,refit=True,verbose=3,scoring='roc_auc',return_train_score=True, n_jobs=-1)



Below are the best parameters chosen for the linaer svc model.

SVC(C=10, coef0=1, gamma=0.0001, kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Cross Validation Metrics



## Average Training Accuracy SVC  Score: (0.694)



## Average Training Area Under the Curve SVC  Score: (0.751)





Test Set




y_pred_svc=svc_clf.predict(X_test_lr  )



Metrics
## ****Support Vector Machines Test Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.96      0.64      0.76      2202
##            1       0.08      0.54      0.14       130
## 
##     accuracy                           0.63      2332
##    macro avg       0.52      0.59      0.45      2332
## weighted avg       0.91      0.63      0.73      2332





## Average Test Set Area Under the Curve SVC  Score: (0.587)







cm_svc_test = metrics.confusion_matrix(y_test_rv, y_pred_svc, labels=[0,1])
df_cm_svc_test = pd.DataFrame(cm_svc_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_svc_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_svc_test.flatten()/np.sum(cm_svc_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_svc_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Support Vector Machine ", fontsize=11)

Figure 24

Figure 24



Feature Importance




feature_importance_svc=pd.DataFrame({'feature':list(X_test_lr.columns),'feature_importance':[abs(i) for i in svc_clf.coef_[0]]})


feature_importance_svc=feature_importance_svc.sort_values('feature_importance',ascending=False)

#Combine feature names and coefficients on top Pandas DataFrame
feature_names_svc=pd.DataFrame(X_test_lr.columns, columns=['Feature'])

svc_coef=pd.DataFrame(np.transpose(svc_clf.coef_), columns=['Coefficent'])

coefficients_svc=pd.concat([feature_names_svc, svc_coef], axis=1)



#Calculate exponent of the svc coefficients

coefficients_svc['Exp_Coefficient']=np.exp(coefficients_svc['Coefficent'])

#Remove coefficients that are equal to zero.

coefficients_svc=coefficients_svc[coefficients_svc['Exp_Coefficient']>=1]



## ******************Top Five Coefficients******************
##                        Feature  Exp_Coefficient
## 22  ohe__degrCde_school_MM_SOM         2.167627
## 10          ohe__Race_Hispanic         1.947730
## 1                   scale__age         1.287697
## 5          scale__loans_not_cc         1.217136
## 2         scale__yrs_to_pay_dt         1.157403



Non Linear



model_SVCrbf=SVC(kernel='rbf', probability=True)


param_grid_SVCrbf={
  'C': [0.1, 1, 10, 10, 1000],
  'gamma': [1,0.1,0.001, 0.0001],
  'degree':[1,3, 5,7],
  'coef0':[0,1, 3,5]
}


grid_SVCrbf=RandomizedSearchCV(model_SVCrbf,param_grid_SVCrbf, cv=5,n_iter=15,refit=True,verbose=3,scoring='roc_auc',return_train_score=True, n_jobs=-1)



grid_SVCrbf.fit(X_train_sm_sv,y_train_sm_sv)
RandomizedSearchCV(cv=5, estimator=SVC(probability=True), n_iter=15, n_jobs=-1,
                   param_distributions={'C': [0.1, 1, 10, 10, 1000],
                                        'coef0': [0, 1, 3, 5],
                                        'degree': [1, 3, 5, 7],
                                        'gamma': [1, 0.1, 0.001, 0.0001]},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Training Set



Below are the best paramters chosen for the non linear svc model

SVC(C=1, coef0=0, degree=1, gamma=1, probability=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





Cross Validation Metrics



## Average Training Accuracy Score: (0.984)





## Average Training Area Under the Curve  Score: (0.996)



Test set Metrics


y_pred_rbf=rbf_clf.predict(X_test_lr  )



## ****Support Vector Machines (RBF) Test Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.98      0.99      0.98      2202
##            1       0.73      0.61      0.66       130
## 
##     accuracy                           0.97      2332
##    macro avg       0.85      0.80      0.82      2332
## weighted avg       0.96      0.97      0.96      2332





## Average Test Set Area Under the Curve  Score: (0.797)





cm_rbf_test = metrics.confusion_matrix(y_test_rv, y_pred_rbf, labels=[0,1])
df_cm_rbf_test = pd.DataFrame(cm_rbf_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_rbf_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_rbf_test.flatten()/np.sum(cm_rbf_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_rbf_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Support Vector Machine (RBF)", fontsize=10)



Figure 25

Figure 25

Analysis

The linear SVC model’s training scores show a slight improvement over the logistic regression model’s scores. With an accuracy of 0.69, it classifies observations correctly at almost 70%. The AUC score is slightly above average. However, when we examine the test metrics, the potential promise of this model comes crashing down. The accuracy drops to 0.59, and the test AUC score is even lower at 0.58, making this model just marginally better than flipping a coin. The linear SVC model’s scores could be affected by the non-linearity of the dataset.

Few datasets are perfectly linear. As such, we tried a non-linear SVC model. The non-linear SVC model shows a significant improvement over its linear counterpart in training scores, with accuracy at 0.984 and AUC at 0.99. These scores categorize the non-linear SVC as a very good model. Checking the test metrics, the accuracy is slightly lower at 0.97. The AUC score drops by 0.20, coming in at 0.80. Although the AUC score is significantly lower, we can still classify it as good. Now, let’s assess how well the non-linear model classifies the event of interest (Y). The recall score tells us that the non-linear model captures six out of ten “Y” classes, and precision informs us that seven out of ten observations classified as “Y” are accurate. This represents a definite improvement over our previous models, although it still hovers around average. The high accuracy score is influenced by the model’s superior ability to classify the “N” class.





Tree Based Models



Data Preparation





resampler_tr=SMOTE(random_state=0)


X_train_sm_tr,y_train_sm_tr=resampler_tr.fit_resample(X_train_tr, y_train_rv)





counter_ytrain_tr_sm=Counter(y_train_sm_tr)





## Original and Resampled Training Target Feature Categories:
## 
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
## 
## Resampled Target Feature y Train sm tr Counter({0: 6608, 1: 6608}):



Xtreme Gradient Boosting



Training Data





from xgboost import XGBClassifier


params_xg={
    "learning_rate": [0.01, 0.05, 0.10, 0.20,0.25,0.30],
    "max_depth": [3,4,5,6,8,10,12,15],
    "min_child_weight": [1,3,5,7],
    "gamma": [0.0,0.01,0.05,0.1,0.5,1,2,3],
    "colsample_bytree": [0.5,0.6,0.7,0.8,0.9],
    "n_estimators":np.arange(500, 5000, 500),
    'subsample': [0.7,0.8,0.9,1],
    'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
  
  
}






xg_model=XGBClassifier(objective='binary:logistic', n_jobs=-1)



skf_xg = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)


xg_cv=RandomizedSearchCV(estimator=xg_model, param_distributions=params_xg, scoring='roc_auc', n_iter=20, cv=5, refit=True,return_train_score=True, n_jobs=-1)


xg_cv.fit(X_train_sm_tr, y_train_sm_tr)
RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, device=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric=None, feature_types=None,
                                           gamma=None, grow_policy=None,
                                           importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate...
                   param_distributions={'colsample_bytree': [0.5, 0.6, 0.7, 0.8,
                                                             0.9],
                                        'gamma': [0.0, 0.01, 0.05, 0.1, 0.5, 1,
                                                  2, 3],
                                        'learning_rate': [0.01, 0.05, 0.1, 0.2,
                                                          0.25, 0.3],
                                        'max_depth': [3, 4, 5, 6, 8, 10, 12,
                                                      15],
                                        'min_child_weight': [1, 3, 5, 7],
                                        'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]),
                                        'reg_alpha': [0, 0.001, 0.005, 0.01,
                                                      0.05],
                                        'subsample': [0.7, 0.8, 0.9, 1]},
                   return_train_score=True, scoring='roc_auc')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Below are the best parameters chosen for the xgboost model.

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.9, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0.1, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.2, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=15, max_leaves=None,
              min_child_weight=5, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=1000, n_jobs=-1,
              num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





Cross Validation Metrics

## Average Training Accuracy XGBoost  Score: (0.982)



## Average Training Area Under the Curve  Score: (0.997)





Test Set Metrics



y_pred_xg=xg_clf.predict(X_test_tr)
## ****XGBoost Test Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.98      0.99      0.98      2202
##            1       0.74      0.61      0.67       130
## 
##     accuracy                           0.97      2332
##    macro avg       0.86      0.80      0.82      2332
## weighted avg       0.96      0.97      0.96      2332



## Average XGBoostTest Set Area Under the Curve Score: (0.797)







cm_xg_test = metrics.confusion_matrix(y_test_rv, y_pred_xg, labels=[0,1])
df_cm_xg_test = pd.DataFrame(cm_xg_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_xg_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_xg_test.flatten()/np.sum(cm_xg_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_xg_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-XGBoost", fontsize=14)

Figure 26

Figure 26

Figure 26

Figure 26

Feature Importance



feat_dict_xg= {}
for col, val in sorted(zip(X_train_tr.columns, xg_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_xg[col]=val
feat_xg_df = pd.DataFrame({'Feature':feat_dict_xg.keys(),'Importance':feat_dict_xg.values()})





values_xg = feat_xg_tp5.Importance    
idx = feat_xg_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values_xg)) else 'red' for x in values_xg ]
sns.barplot(y=idx,x=values_xg,palette=clrs).set(title='Important features XGBoost Model')

#plt.ylabel("Features", fontsize=6)

plt.xlabel("Importance", fontsize=6)

plt.tick_params(axis='x', which='major', labelsize=8)

plt.tick_params(axis='y', labelsize=3,labelrotation=50)



Figure 27

Figure 27







Gradient Boost





from sklearn.ensemble import GradientBoostingClassifier



gb=GradientBoostingClassifier(warm_start=True)




gb_params={
  'subsample':[0.4, 0.6, 0.7, 0.75],
  'n_estimators':np.arange(500, 5000, 500),
  'learning_rate':[0.01,0.05, 0.075,0.1],
  'max_features':range(7,20,2),
  'min_samples_split':range(1000,2100,200),
  'min_samples_leaf':range(30,71,10),
  'max_depth':range(5,16,2),
  
}





search_gb=RandomizedSearchCV(estimator=gb,
param_distributions=gb_params,n_iter=20, scoring='roc_auc', cv=5, verbose=1, refit=True, n_jobs=-1, return_train_score=True,random_state=2)





search_gb.fit(X_train_sm_tr, y_train_sm_tr)
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(warm_start=True),
                   n_iter=20, n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.05, 0.075,
                                                          0.1],
                                        'max_depth': range(5, 16, 2),
                                        'max_features': range(7, 20, 2),
                                        'min_samples_leaf': range(30, 71, 10),
                                        'min_samples_split': range(1000, 2100, 200),
                                        'n_estimators': array([ 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]),
                                        'subsample': [0.4, 0.6, 0.7, 0.75]},
                   random_state=2, return_train_score=True, scoring='roc_auc',
                   verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Below are the best parameters chosen for the gradient boost model.



GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
                           min_samples_split=1000, n_estimators=4000,
                           subsample=0.7, warm_start=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





Cross Validation Scores



## Average Training Accuracy Gradient Boost  Score: (0.983)



## Average Training Area Under the Curve-GradientBoost  Score: (0.997)





Test Set




y_pred_gb=gb_clf.predict(X_test_tr)
## ****Gradient Boosting Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.98      0.99      0.98      2202
##            1       0.78      0.62      0.69       130
## 
##     accuracy                           0.97      2332
##    macro avg       0.88      0.80      0.84      2332
## weighted avg       0.97      0.97      0.97      2332



## Average Test Set Area Under The Curve- Gradient Boost  Score: (0.802)







cm_gb_test = metrics.confusion_matrix(y_test_rv, y_pred_gb, labels=[0,1])
df_cm_gb_test = pd.DataFrame(cm_gb_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_gb_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_gb_test.flatten()/np.sum(cm_gb_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,6))
sns.heatmap(df_cm_gb_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-Gradient Boost", fontsize=14)



Figure 28

Figure 28





Features



feat_dict_gb = {}
for col, val in sorted(zip(X_train_tr.columns,gb_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_gb[col]=val

feat_gb_df= pd.DataFrame({'Feature':feat_dict_gb.keys(),'Importance':feat_dict_gb.values()})



values = feat_gb_tp5.Importance    
idx = feat_gb_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features  Gradient Boosting Model')

plt.ylabel("Features", fontsize=6)
plt.xlabel("Importance", fontsize=6)

plt.tick_params(axis='x', which='major', labelsize=8)

plt.tick_params(axis='y', labelsize=4, labelrotation=55)



Figure 29

Figure 29





AdaBoost






from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier



Training






base_tree = DecisionTreeClassifier(max_depth=6)


adb=AdaBoostClassifier(random_state=1,estimator=base_tree)



adb_params={
  'n_estimators':[1000, 3000, 5000, 8000, 10000],
  'learning_rate':(0.0001,.001,.01,0.3, 0.4, 0.5,1)
  
}


adb_grid=RandomizedSearchCV(estimator=adb,
param_distributions=adb_params,n_iter=10, scoring='roc_auc', cv=5, verbose=3, refit=True,return_train_score=True, n_jobs=-1)



adb_grid.fit(X_train_sm_tr, y_train_sm_tr)
RandomizedSearchCV(cv=5,
                   estimator=AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
                                                random_state=1),
                   n_jobs=-1,
                   param_distributions={'learning_rate': (0.0001, 0.001, 0.01,
                                                          0.3, 0.4, 0.5, 1),
                                        'n_estimators': [1000, 3000, 5000, 8000,
                                                         10000]},
                   return_train_score=True, scoring='roc_auc', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



Below are the best parameters chosen for the ada boost model.

AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
                   learning_rate=0.5, n_estimators=1000, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.





Train Cross Validation Metrics



## Average Training Accuracy Score-Ada Boost: (0.983)






print("Average Training Area Under the Curver Score-Ada Boost : (%.3f)" % xg_cv_acc)
## Average Training Area Under the Curver Score-Ada Boost : (0.982)







Test Set




y_pre_adb=adb_clf.predict(X_test_tr)

Test Set Metrics



## ****AdaBoost Validation Classification Report****
##               precision    recall  f1-score   support
## 
##            0       0.98      0.99      0.98      2202
##            1       0.83      0.61      0.70       130
## 
##     accuracy                           0.97      2332
##    macro avg       0.90      0.80      0.84      2332
## weighted avg       0.97      0.97      0.97      2332



## Average Test Set Area Under the Curve Score-Ada Boost : (0.800)





cm_adb_vl = metrics.confusion_matrix(y_test_rv, y_pred_adb, labels=[0,1])
df_cm_adb_vl = pd.DataFrame(cm_adb_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_adb_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_adb_vl.flatten()/np.sum(cm_adb_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)

plt.figure(figsize=(9,9))
sns.heatmap(df_cm_adb_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.title("Confusion Matrix-AdaBoost", fontsize=9)



Figure 30

Figure 30

Feature Importance




feat_dict_adb= {}
for col, val in sorted(zip(X_train_tr.columns,adb_clf.feature_importances_),key=lambda x:x[1],reverse=True):
  feat_dict_adb[col]=val

feat_adb_df = pd.DataFrame({'Feature':feat_dict_adb.keys(),'Importance':feat_dict_adb.values()})





values = feat_adb_tp5.Importance    
idx = feat_adb_tp5.Feature
plt.figure(figsize=(12,10))
clrs = ['green' if (x < max(values)) else 'red' for x in values ]
sns.barplot(y=idx,x=values,palette=clrs).set(title='Important features  AdaBoost Model')

plt.ylabel("Features", fontsize=6)
plt.xlabel("Importance", fontsize=6)

plt.tick_params(axis='x', which='major', labelsize=4)

plt.tick_params(axis='y', labelsize=4, labelrotation=50)



Figure 31

Figure 31



Clusters

library(cluster)

library(Rtsne)

Gower Distance

dflt_grad_6<-readRDS("dflt_grad_6.rds")

Select features

clust_1<-dflt_grad_6 %>% 
  mutate(id=row_number())



clust_2<-clust_1 %>% 
  select(exit_reason,degr_cde, local_hrs_attempt, local_hrs_earned, marital_status, efc, gender, Race,citizen_status, age, yrs_to_pay_dt, yrs_to_exit_dt, undergrad_loans_cc, grad_loans_cc, total_loans_cc, loans_not_cc, id )





gower_dist<-clust_2 %>% 
  select(-id) %>% 
  daisy(metric="gower",
        type=list(logratio=3))



gower_mat<-as.matrix(gower_dist)



sil_width <- c(NA)
for(i in 2:8){  
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
}



Figure 31

Figure 31



Based on the elbow plot (Figure 31), we will select six for our cluster generation.



k <- 6
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <-  clust_2 %>%
  mutate(cluster = pam_fit$clustering)



Cluster Dimensions

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering))





Figure 32

Figure 32





gower_clusters<-clust_2 %>% 
  mutate(cluster=pam_fit$clustering)



gower_clusters$cluster<-as.factor(gower_clusters$cluster)



clust_3<-gower_clusters %>% 
  select(id, cluster)



clust_3<-clust_3 %>% 
  left_join(clust_1, on="id")
## Joining, by = "id"



clust_3<-clust_3 %>% 
  rename(major=major_minor_desc)







Cluster Review





gow_clust<-clust_3 %>% 
  count(cluster) %>% 
  mutate(perc = round(n / sum(n),2))



Overview





gow_clust_perc<-ggplot(gow_clust, aes(x = reorder(cluster, -perc), y = perc, fill=cluster))+ 
  geom_bar(stat = "identity")+
  labs(x = "Cluster", y = "Percent", fill = "Cluster")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),
                y=perc + .004),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  theme(axis.ticks.x=element_blank(),axis.title.y=element_blank(),axis.text.y=element_text(size=8),legend.position = "none")







Figure 33

Figure 33





dflt_perc<-clust_3 %>% 
count(cluster, loan_status) %>% 
  group_by(cluster) %>% 
  mutate(perc = n / sum(n))



dflt_perc_y<-dflt_perc %>% 
  spread(loan_status, perc)







Figure 34

Figure 34





clust_gender<-clust_3 %>% 
  count(cluster, gender) %>% 
  group_by(cluster) %>% 
  mutate(perc =round( n / sum(n),3)) %>% 
  ggplot(aes(x=cluster, y=perc, fill=gender))+
  geom_col(position="dodge")+
  labs(x = "Cluster", y = "Percent", fill = "Gender")+
  scale_y_continuous(labels=scales::percent_format())+
  #geom_text(aes(label=percent(perc),
               #y=perc + .03),
            #position=position_dodge(0.9),
           #vjust=.05, size=3)+
  theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=20, hjust=1))+
  theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  theme(legend.title = element_text(size=8, face="bold"))+
  theme(legend.background = element_rect(fill="lightblue",
                                  linewidth=0.5, linetype="solid", 
                                  colour ="darkblue"))+
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  ggtitle(" Gender by Cluster")





Figure 35

Figure 35





clust_cit<-clust_3 %>% 
  count(cluster, citizen_status) %>% 
  group_by(cluster) %>% 
  mutate(perc =round( n / sum(n),3)) %>% 
  ggplot(aes(x=cluster, y=perc, fill=citizen_status))+
  geom_col(position="dodge")+
  labs(x = "Cluster", y = "Percent", fill = "Citizenship Status")+
  scale_y_continuous(labels=scales::percent_format())+
  #geom_text(aes(label=percent(perc),
               #y=perc + .03),
            #position=position_dodge(0.9),
           #vjust=.05, size=3)+
  theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=20, hjust=1))+
  theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  theme(legend.title = element_text(size=8, face="bold"))+
  theme(legend.background = element_rect(fill="lightblue",
                                  linewidth=0.5, linetype="solid", 
                                  colour ="darkblue"))+
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  ggtitle("Citizenship Status  by Cluster")





Figure 36

Figure 36





age_ug_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=age, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Age by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=12, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 37

Figure 37





tot_cc_loan_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=total_loans_cc, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Total Loans Borrowed at Cambridge College by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 38

Figure 38





grad_loan_cc_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=grad_loans_cc, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Graduate Loans Borrowed at Cambridge College by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 39

Figure 39





non_cc_loan_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=loans_not_cc  , fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Total Loans Not Borrowed at Cambridge College by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=12, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 40

Figure 40





total_loan_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=nslds_loan_total  , fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Total Loans Borrowed  by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 41

Figure 41





efc_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=efc, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("EFC by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 42

Figure 42





att_cr_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=local_hrs_attempt, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Attempted Credits by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 43

Figure 43





earned_cr_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=local_hrs_earned, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Earned Credits by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),
    axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 44

Figure 44





exit_cr_box<-clust_3 %>% 
  ggplot(aes(x=cluster, y=yrs_to_pay_dt, fill=cluster))+
  geom_boxplot()+
   #labs(x = "Cluster", y = "Age")+
  ggtitle("Years to Exit by Cluster")+
  theme(legend.position = "none",plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))+
  theme(legend.position = "none")+
    theme(axis.ticks.x=element_blank(),axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.y=element_text(size=7))





Figure 45

Figure 45





Cluster 1

rac_clust_1<-clust_3 %>% 
  filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==1) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_1_perc<-ggplot(rac_clust_1, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .02),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  theme(axis.text.y=element_text(size=7))





Figure 46

Figure 46





marital_clust_1<-clust_3 %>% 
  filter(cluster==1) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_1_perc<-ggplot(marital_clust_1, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster One Marital Status")





Figure 47

Figure 47





exit_clust_1<-clust_3 %>% 
  filter(cluster==1) %>% 
  count(exit_reason) %>% 
  mutate(perc = round(n / sum(n),2))



exit_clust_1_perc<-ggplot(exit_clust_1, aes(x = reorder(exit_reason, -perc), y = perc, fill=exit_reason))+ 
  geom_bar(stat = "identity")+
  labs(x = "Exit Reason", y = "Percent", fill = "Exit Reason")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + 0.01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  theme(axis.text.y=element_text(size=7))





Figure 48

Figure 48





degree_clust_1<-clust_3 %>% 
  filter(cluster==1) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_1_perc<-ggplot(degree_clust_1, aes(x=fct_reorder(degr_cde, perc), y=perc,color=degr_cde,group=1, text=paste("Program: ",degr_cde,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=degr_cde, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster One Degree Programs")





Figure 49

Figure 49





school_clust_1<-clust_3 %>% 
  filter(cluster==1) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_1_perc<-ggplot(school_clust_1, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  theme(axis.text.y=element_text(size=7))





Figure 50

Figure 50





major_clust1<-clust_3 %>% 
  filter(cluster==1) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=5)



major_clust1_plt<-ggplot(major_clust1, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 1")





Figure 51

Figure 51





Cluster 2

rac_clust_2<-clust_3 %>% 
   filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==2) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_2_perc<-ggplot(rac_clust_2, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),color="white",
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  xlab("Race")+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=60,size=7, hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
   theme(axis.text.y=element_text(size=7))





Figure 52

Figure 52





marital_clust_2<-clust_3 %>% 
  filter(cluster==2) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_2_perc<-ggplot(marital_clust_2, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster Two Marital Status")





Figure 53

Figure 53





exit_clust_2<-clust_3 %>% 
  filter(cluster==2) %>% 
  count(exit_reason) %>% 
  mutate(perc = round(n / sum(n),2))



exit_clust_2_perc<-ggplot(exit_clust_2, aes(x = reorder(exit_reason, -perc), y = perc, fill=exit_reason))+ 
  geom_bar(stat = "identity")+
  labs(x = "Exit Reason", y = "Percent", fill = "Exit Reason")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=7.5,hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 10, face = "bold", vjust=-1))+
  theme(axis.text.y=element_text(size=7))





Figure 54

Figure 54





degree_clust_2<-clust_3 %>% 
  filter(cluster==2) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_2_perc<-ggplot(degree_clust_2, aes(x=fct_reorder(degr_cde, perc), y=perc,color=degr_cde,group=1, text=paste("Program: ",degr_cde,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=degr_cde, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster One Degree Programs")





saveRDS(degree_clust_2_perc, "degree_clust_2_perc.rds")
Figure 55

Figure 55





school_clust_2<-clust_3 %>% 
  filter(cluster==2) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_2_perc<-ggplot(school_clust_2, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.title.x = element_text(color = "#993333", size = 9, face = "bold", vjust=-1))+
  theme(axis.text.y=element_text(size=7))





Figure 56

Figure 56





 major_clust2<-clust_3 %>% 
  filter(cluster==2) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=5)



major_clust2_plt<-ggplot(major_clust2, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 2")





Figure 57

Figure 57





Cluster 3

rac_clust_3<-clust_3 %>% 
   filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==3) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_3_perc<-ggplot(rac_clust_3, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),color="white",
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=8,hjust=1))+
   theme(axis.text.y=element_text(size=7.5))





Figure 58

Figure 58





marital_clust_3<-clust_3 %>% 
  filter(cluster==3) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_3_perc<-ggplot(marital_clust_3, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster Three Marital Status")





Figure 59

Figure 59





exit_clust_3<-clust_3 %>% 
  filter(cluster==3) %>% 
  count(exit_reason) %>% 
  mutate(perc = round(n / sum(n),2))



exit_clust_3_perc<-ggplot(exit_clust_3, aes(x = reorder(exit_reason, -perc), y = perc, fill=exit_reason))+ 
  geom_bar(stat = "identity")+
  labs(x = "Exit Reason", y = "Percent", fill = "Exit Reason")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.text.y=element_text(size=7))





Figure 60

Figure 60





degree_clust_3<-clust_3 %>% 
  filter(cluster==3) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_3_perc<-ggplot(degree_clust_3, aes(x = reorder(degr_cde, -perc), y = perc, fill=degr_cde))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree", y = "Percent", fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=9,hjust=1))+
  theme(axis.text.y=element_text(size=8))





Figure 61

Figure 61





school_clust_3<-clust_3 %>% 
  filter(cluster==3) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_3_perc<-ggplot(school_clust_3, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.text.y=element_text(size=9))





Figure 62

Figure 62





major_clust3<-clust_3 %>% 
  filter(cluster==3) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=5)



major_clust3_plt<-ggplot(major_clust3, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 3")





Figure 63

Figure 63





Cluster 4

rac_clust_4<-clust_3 %>% 
   filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==4) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_4_perc<-ggplot(rac_clust_4, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),color="white",
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=9,hjust=1))+
   theme(axis.text.y=element_text(size=8))





Figure 64

Figure 64





marital_clust_4<-clust_3 %>% 
  filter(cluster==4) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_4_perc<-ggplot(marital_clust_4, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster Four Marital Status")





Figure 65

Figure 65





exit_clust_4<-clust_3 %>% 
  filter(cluster==4) %>% 
  count(exit_reason) %>% 
  mutate(perc = round(n / sum(n),2))



exit_clust_4_perc<-ggplot(exit_clust_4, aes(x = reorder(exit_reason, -perc), y = perc, fill=exit_reason))+ 
  geom_bar(stat = "identity")+
  labs(x = "Exit Reason", y = "Percent", fill = "Exit Reason")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(size=8,hjust=1))+
  theme(axis.text.y=element_text(size=9))





Figure 66

Figure 66





degree_clust_4<-clust_3 %>% 
  filter(cluster==4) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_4_perc<-ggplot(degree_clust_4, aes(x = reorder(degr_cde, -perc), y = perc, fill=degr_cde))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree", y = "Percent", fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc,), 
                color="white",size=3.0,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=9,hjust=1))+
  theme(axis.text.y=element_text(size=8))





Figure 67

Figure 67





school_clust_4<-clust_3 %>% 
  filter(cluster==4) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_4_perc<-ggplot(school_clust_4, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .01),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=9,hjust=1))+
  theme(axis.text.y=element_text(size=9))





Figure 68

Figure 68



major_clust4<-clust_3 %>% 
  filter(cluster==4) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=5)



major_clust4_plt<-ggplot(major_clust4, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 4")





Figure 69

Figure 69





Cluster 5

rac_clust_5<-clust_3 %>% 
   filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==5) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_5_perc<-ggplot(rac_clust_5, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),color="white",
                y=perc + .014),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
   theme(axis.text.y=element_text(size=9))





Figure 70

Figure 70





marital_clust_5<-clust_3 %>% 
  filter(cluster==5) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_5_perc<-ggplot(marital_clust_5, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster Four Marital Status")





Figure 71

Figure 71





degree_clust_5<-clust_3 %>% 
  filter(cluster==5) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_5_perc<-ggplot(degree_clust_5, aes(x = reorder(degr_cde, -perc), y = perc, fill=degr_cde))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree", y = "Percent", fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .015),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
  theme(axis.text.y=element_text(size=9))





Figure 72

Figure 72





school_clust_5<-clust_3 %>% 
  filter(cluster==5) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_5_perc<-ggplot(school_clust_5, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=2.0,
                y=perc + .017),
            position=position_dodge(0.9),
            vjust=1.0)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
  theme(axis.text.y=element_text(size=8))





Figure 73

Figure 73





major_clust5<-clust_3 %>% 
  filter(cluster==5) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=8)



major_clust5_plt<-ggplot(major_clust5, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 5")





Figure 74

Figure 74





Cluster 6

rac_clust_6<-clust_3 %>% 
   filter(Race %in% c("White", "African American", "Hispanic", "Asian", "Other")) %>% 
  filter(cluster==6) %>% 
  count(Race) %>% 
  mutate(perc = round(n / sum(n),2))



rac_clust_6_perc<-ggplot(rac_clust_6, aes(x = reorder(Race, -perc), y = perc, fill=Race))+ 
  geom_bar(stat = "identity")+
  labs(x = "Race", y = "Percent", fill = "Race")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc),color="white",
                y=perc + .017),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
   theme(axis.text.y=element_text(size=12))





Figure 75

Figure 75





marital_clust_6<-clust_3 %>% 
  filter(cluster==6) %>% 
  count(marital_status) %>% 
  mutate(perc = round(n / sum(n),2))



marital_clust_6_perc<-ggplot(marital_clust_6, aes(x=fct_reorder(marital_status , perc), y=perc,color=marital_status ,group=1, text=paste("Program: ",marital_status ,"<br>Percent: ", perc))) +
    geom_segment( aes(xend=marital_status, yend=0)) +
    geom_point( size=4, color="dark green") +
  scale_y_continuous(labels=scales::percent_format())+
    theme(plot.title = element_text(color="blue", size=11, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none",axis.title.x=element_blank(),
    axis.title.y=element_blank())+
    coord_flip() +
    
    #ylab("Median Age")+
  #xlab("Program")
  ggtitle("Cluster Six Marital Status")





Figure 76

Figure 76





degree_clust_6<-clust_3 %>% 
  filter(cluster==6) %>% 
  count(degr_cde) %>% 
  mutate(perc = round(n / sum(n),2))



degree_clust_6_perc<-ggplot(degree_clust_6, aes(x = reorder(degr_cde, -perc), y = perc, fill=degr_cde))+ 
  geom_bar(stat = "identity")+
  labs(x = "Degree", y = "Percent", fill = "Degree")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .017),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank(),axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
  theme(axis.text.y=element_text(size=12))





Figure 77

Figure 77





school_clust_6<-clust_3 %>% 
  filter(cluster==6) %>% 
  count(school) %>% 
  mutate(perc = round(n / sum(n),2))



school_clust_6_perc<-ggplot(school_clust_6, aes(x = reorder(school, -perc), y = perc, fill=school))+ 
  geom_bar(stat = "identity")+
  labs(x = "School", y = "Percent", fill = "School")+
  scale_y_continuous(labels=scales::percent_format())+
  geom_text(aes(label=percent(perc), color="white",size=3.5,
                y=perc + .011),
            position=position_dodge(0.9),
            vjust=0.5)+
  theme(axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank(),axis.title.y=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=10,hjust=1))+
  theme(axis.text.y=element_text(size=12))





Figure 78

Figure 78





major_clust6<-clust_3 %>% 
  filter(cluster==6) %>% 
  count(major) %>% 
  mutate(perc = round(n / sum(n),2)) %>% 
 slice_max(perc, n=5)



major_clust6_plt<-ggplot(major_clust6, aes(x = reorder(major, -perc), y = perc, fill=major))+ 
  geom_bar(stat = "identity")+
  labs(x = "Major", y = "Count", fill = "Major")+
  #geom_text(aes(label=Count),
           # position=position_dodge(0.9),
           # vjust=1.1)+
   scale_y_continuous(labels=scales::percent_format())+
  theme(plot.title = element_text(color="blue", size=10, face="bold.italic", hjust=0.5),
    axis.ticks.x=element_blank(),legend.position = "none")+
   theme(axis.ticks.x=element_blank())+
  theme(axis.text.x=element_text(angle=70, size=6,hjust=1),axis.text.y=element_text(size=6,hjust=1),axis.title.x=element_blank(),
    axis.title.y=element_blank())+
  ggtitle("Top Five Majors Cluster 6")





Figure 79

Figure 79





Cluster Analysis



From Figure 34, we observe the default percentage of each cluster. Remember, from Table 12, the default percentage for graduate students is 6%. Clusters 3, 5, and 6 are above this percentage, and as such, we’ll focus on these clusters to find what stands out.

Cluster 3 has the highest default percentage at 9.53%. The median total loans borrowed at CC (Figure 38) are the lowest of all clusters, while the median total loans borrowed at other colleges (Figure 40) are the highest at 16,741. The median loans borrowed are striking, as except for clusters 3 and 6, all other clusters have a zero median. The higher median of loans borrowed at other colleges is why Cluster 3’s median total loans borrowed (Figure 41) is the second highest at 46,892. The median EFC is 50, placing this group as high need. Attempted and earned credits (Figures 43 and 44) are low at 15 and 11, respectively. This is possibly due to students in this group exiting sooner in their enrollment, as other clusters have medians of 30 or greater. This is supported by Figure 45, which shows the median exit time is less than a year, and Figure 60, which has no students in this group completing their degree. Another striking feature is the race distribution (Figure 58), which shows that just under three-quarters of Cluster 3 is African American.

Cluster 5 has the second-highest default percentage at 7.33% (Figure 34). The median EFC for this group is, just like Cluster 3, placing this group as high need. We find from Figure 70 that 97% of this group is Hispanic, the majority of which are from Puerto Rico, Lawrence, and Springfield.

Cluster 6’s default percentage is only slightly higher than the population percentage at 6.21% (Figure 34). Its median loans borrowed at CC are 37,484 (Figure 38), only slightly higher than Cluster 4. Median loans borrowed at other colleges (Figure 40) are 12,871. The median total loans borrowed (Figure 41) are 57,499, 10,000 greater than the next closest cluster. The median EFC (Figure 42) is 188, and African Americans make up 98% of this group.





Insights



Our clusters can help gleam more information supporting feature importance. Specifically, we’ll focus on clusters 3,5 and 6 as these clusters have default rates above 6 %.

These clusters all have median efc’s below 200 which places them as especially financially needy populations.

Cluster 3 and 6 have non cc loan medians above 6000, while cluster 6 also has grad loans borrowed at cc median of 56,000.

Delving further, we find cluster 3 and 6 have populations that are majority African American while cluster 5 is majority Hispanic population.

Finally, from cluster 3 we find the entire population withdrew or unofficially withdrew.

Action Plan:

Individual loan entrance counseling plan will be designed for students entering with loan balances from previous colleges and efc’s below 200.Loan entrance counseling for Hispanic students will be provided in Spanish.

Students withdrawing from the college will be contacted immediately and provided information on a third-party loan counseling center for help with repayment options.

Though all students are provided loan exit counseling, those students with total loan debt over 50,000 will be contact individually and provided contact information to a third part loan counseling service.

Let’s check on how our segmentation of this population can focus on details. Firstly, clusters 3,5, and 6 all had median efc’s under two hundered. Secondly, clusters 3 and 6’s poipulation are mostly African American and cluster 5’s poipukation is mostly hispanic. Thirdly, Cluster’s 3 and 6 have high median loans borrowed at other instituitions. Cluster 6 addtionallhy has the highest median of graduate loans borrowed at CC.

Personlized entrance counseling plans will be designed for populations with efc’s under 200 and African-American, and entering with previous loan debt. Hispanic students with efc’s under 200 will have entrance counseling designed in spanish tohelp guid them in borrowering in their first language. Though student’s withdrawing is signficant in cluster 3, we will reach out to all students withdrawing without a dgree and provide them thriud party loan counseling to help them paln for their lkan payments. Student’s borroweing over 50,000 will be encouraged to utilze our thuird party loan counseling service to udnerstrand the best payment options and their rights.